Skip to content

fix(workflow-engine): only commit step state after success#5010

Merged
abcxff merged 1 commit into
mainfrom
05-09-fix_workflow-engine_only_commit_step_state_after_success
May 19, 2026
Merged

fix(workflow-engine): only commit step state after success#5010
abcxff merged 1 commit into
mainfrom
05-09-fix_workflow-engine_only_commit_step_state_after_success

Conversation

@abcxff
Copy link
Copy Markdown
Contributor

@abcxff abcxff commented May 9, 2026

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 9, 2026

🚅 Deployed to the rivet-pr-5010 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-inspector 🕒 Building (View Logs) Web May 19, 2026 at 8:12 pm
frontend-cloud 🕐 Queued (View Logs) Web May 19, 2026 at 8:11 pm
kitchen-sink 🕐 Queued (View Logs) Web May 19, 2026 at 8:11 pm
website 😴 Sleeping (View Logs) Web May 19, 2026 at 4:44 pm
ladle ❌ Build Failed (View Logs) Web May 9, 2026 at 3:40 pm
mcp-hub ✅ Success (View Logs) Web May 9, 2026 at 3:40 pm

Copy link
Copy Markdown
Contributor Author

abcxff commented May 9, 2026

@abcxff abcxff changed the title fix(workflow-engine): only commit step state after success [slop]fix(workflow-engine): only commit step state after success May 9, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

Code Review: fix(workflow-engine): only commit step state after success

Note: This PR is still in Draft status with an empty description/checklist.

Overview

This PR fixes a correctness bug in workflow-engine where actor state mutations inside a failing step were being persisted even though the step needs to be retried. On retry, the step would execute against incorrectly mutated state. The fix:

  1. Introduces RAW_STATE_SYMBOL to bypass the write-through proxy and read raw actor state.
  2. Wraps every step() and tryStep() call in #withActorAccessAndStateRollback, which snapshots state+vars before the step and rolls them back on any error.
  3. Removes the per-branch flushStorage() calls in step failure catch blocks, consolidating to the single flush in setRetryState / executeWorkflow's unrecoverable-error path.
  4. Removes the stepData.error fallback (justified by the new test).

The core idea is correct and addresses a real workflow replay bug.


Bug: Breaks stateless actors using workflow steps

withActorAccessAndStateRollback unconditionally calls this.#runCtx[RAW_STATE_SYMBOL]() (workflow/context.ts:618), but ActorContextHandleAdapter[RAW_STATE_SYMBOL] throws stateNotEnabledError() when the actor has no state configured (native.ts:2460-2465).

This means any actor that uses workflow steps but does not have a state field will crash on the snapshot call, breaking all ctx.step() invocations regardless of whether the step touches state. The guard needs to handle the state-disabled case, for example by returning undefined and skipping the rollback assignment when #stateEnabled is false.

Concern: structuredClone(vars) can throw for non-cloneable objects

vars are runtime-in-memory (not persisted), so they can legitimately hold class instances, Promises, or WeakRefs. structuredClone throws a DataCloneError on these, crashing the snapshot before the step runs. Worth documenting that vars must be structured-cloneable when workflows are used, or guarding the clone.

Minor: Notification fires before flush for timeout/critical errors

Previously flushStorage() was called before notifyStepError for StepTimeoutError/CriticalError. Now the notify fires before the eventual top-level flush in executeWorkflow. This slightly widens the crash window between notification and persistence (duplicate notification possible on replay). Not a durability regression, but worth a comment.


Code Quality

Positive changes:

  • Consolidating entry.kind.data.error and entry.dirty = true before all the if-branches eliminates three identical blocks.
  • Removing the double-flush on failure (catch block plus setRetryState) is correct; the top-level flush in executeWorkflow guarantees persistence on all error paths.
  • Removing the stepData.error fallback is justified since the new test confirms the error is always written to the step entry.

Issues:

  1. Misleading test name (steps.test.ts:573): 'should not commit step error data to entry on failure' but the test asserts expect(entry.kind.data.error).toBe('Error: step failed'), i.e. the error is committed. Rename to 'should commit error but not output to step entry on failure'.

  2. Bare catch {} in the new test's first-run block silently swallows unexpected errors. Add a comment explaining what error is expected.

  3. // @ts-nocheck in workflow/context.ts: the RAW_STATE_SYMBOL call is unverified by the type system. The interface declares the method returning TState, with TState = never for stateless actors the type system would surface the mismatch if checking were enabled.

  4. NativeConnAdapter[RAW_STATE_SYMBOL] (native.ts:1162): added for interface completeness. The workflow rollback path only flows through ActorContextHandleAdapter, so this is dead code in practice. Fine as-is.


Test Coverage

  • New test is a good regression guard for the flush behavior; see naming concern above.
  • Removed StripStepHistoryErrorDriver test is a coherent removal - it tested the stepData.error fallback that was deliberately deleted.
  • Missing: no test covers stateless actor plus workflow steps. Once the crash is fixed, a test like 'should not crash when actor has no state and step fails' would prevent regression.
  • Loop expectation change (toBe(2) to toBe(3)) appears correct: crash at iteration 3, state now only persisted after successful Loop.continue at end of iteration 2, so iterations 3-5 need to re-execute on resume.

Summary

Area Finding
Core logic Correct - rollback on step failure fixes a real replay bug
Bug Stateless actors with workflow steps crash on RAW_STATE_SYMBOL() before step runs
Concern structuredClone(vars) throws if vars contains non-cloneable objects
Minor Notification fires before flush for exhausted/timeout errors (slightly wider crash window)
Test name 'should not commit step error data' contradicts what the test actually asserts
PR hygiene Description, type checkboxes, and test description are all empty; PR is still in draft

The core approach is sound. Please address the stateless actor crash (and consider the vars clone concern) before merging.

@abcxff abcxff changed the base branch from 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan to graphite-base/5010 May 11, 2026 03:41
@abcxff abcxff force-pushed the graphite-base/5010 branch from 56af1d1 to aac9634 Compare May 12, 2026 12:55
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from cc1baa6 to 9304f04 Compare May 12, 2026 12:55
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan May 12, 2026 12:55
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 12, 2026 12:56 Destroyed
@abcxff abcxff changed the base branch from 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan to graphite-base/5010 May 12, 2026 13:21
@abcxff abcxff mentioned this pull request May 12, 2026
11 tasks
@abcxff abcxff force-pushed the graphite-base/5010 branch from aac9634 to 5f7bac7 Compare May 12, 2026 15:24
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 9304f04 to 0668fa7 Compare May 12, 2026 15:24
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-12-temp_fix_wasm-pack May 12, 2026 15:24
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 12, 2026 15:24 Destroyed
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 0668fa7 to 2c029c2 Compare May 12, 2026 17:46
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 12, 2026 17:46 Destroyed
@abcxff abcxff mentioned this pull request May 12, 2026
11 tasks
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 2c029c2 to 9b64fa5 Compare May 12, 2026 18:03
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 12, 2026 18:03 Destroyed
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 9b64fa5 to 0eb8f51 Compare May 12, 2026 18:44
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 12, 2026 18:44 Destroyed
@abcxff abcxff force-pushed the 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy branch from 8c70217 to 00af8a0 Compare May 13, 2026 20:09
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 13, 2026 20:09 Destroyed
@abcxff abcxff force-pushed the 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy branch 2 times, most recently from ad1ebe6 to 6d4d578 Compare May 13, 2026 20:15
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 2258fbe to aa538e5 Compare May 13, 2026 20:16
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 13, 2026 20:16 Destroyed
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from aa538e5 to a70e048 Compare May 14, 2026 18:09
@abcxff abcxff force-pushed the 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy branch from 6d4d578 to d02776c Compare May 14, 2026 18:09
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 14, 2026 18:09 Destroyed
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from a70e048 to 904ee37 Compare May 18, 2026 16:58
@abcxff abcxff force-pushed the 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy branch from d02776c to 2afe212 Compare May 18, 2026 16:58
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 18, 2026 16:58 Destroyed
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 904ee37 to 4480579 Compare May 18, 2026 17:10
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 18, 2026 17:10 Destroyed
@abcxff abcxff changed the title [slop]fix(workflow-engine): only commit step state after success fix(workflow-engine): only commit step state after success May 18, 2026
@abcxff abcxff marked this pull request as ready for review May 18, 2026 19:18
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 4480579 to 5c16781 Compare May 18, 2026 19:23
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 18, 2026 19:23 Destroyed
Copy link
Copy Markdown
Contributor Author

abcxff commented May 19, 2026

Merge activity

  • May 19, 8:06 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • May 19, 8:08 PM UTC: The Graphite merge of this pull request was cancelled.
  • May 19, 8:08 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • May 19, 8:12 PM UTC: Graphite rebased this pull request as part of a merge.
  • May 19, 8:13 PM UTC: @abcxff merged this pull request with Graphite.

@abcxff abcxff changed the base branch from 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy to graphite-base/5010 May 19, 2026 20:09
@abcxff abcxff changed the base branch from graphite-base/5010 to main May 19, 2026 20:10
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 5c16781 to 4628cd9 Compare May 19, 2026 20:11
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5010 May 19, 2026 20:11 Destroyed
@abcxff abcxff merged commit c999f19 into main May 19, 2026
11 of 18 checks passed
@abcxff abcxff deleted the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch May 19, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant